Operating systems

Operating System

Each panel walks one path of code from a user program down to the hardware and back. Every step shows where control sits, what the kernel is doing, and where the bytes actually move.

Panel

How one call walks through the operating system

Every path below starts in user code and follows the same lane: the C library, the system call, the kernel, the driver, the device, and the wake-up that hands control back to the calling program. The numbered steps stay in lock-step with the diagram so the logic reads top to bottom.

Read and write a file:

Talk to hardware:

Make data durable:

How to read this: the highlighted block in the diagram is the part of the system currently running. The step below it says, in code-flow terms, what just happened and what runs next.

Current call: idle

Control ownership: idle

Data path: idle

Why one load can be 100 times slower than another

Each scenario follows a single load or store from the CPU down through the caches until main memory actually sees it. The hit, miss, and shared-line cases show why the same line of C can behave very differently at runtime.

Locality:

MESI:

Write policy:

L1 hits

L1 misses

writebacks

DRAM writes

bus invalidations

tag[31:9]0x00028

set[8:6]5

offset[5:0]0x00

Mental model: the cache is a content-addressable buffer keyed by tag. MESI tracks who owns a line. Write policy decides when stores escape to DRAM. The "snoop bus" between cores is AMBA ACE on most Cortex-A SoCs and AMBA CHI on Neoverse-class server parts; transaction names like ReadShared and ReadUnique come from those specs.

From a line of C to one instruction moving through the CPU

Four parts in order: a C snippet lowered to machine code, one instruction walking the pipeline, the block-level shape of a modern out-of-order core, and a small register and mnemonic cheat sheet. The point is to see, end to end, what a piece of C actually runs as.

Sample C → ARM assemblyHow a few common idioms lower to AArch64

C source

AArch64 assembly

What runs on the chipcaller, function, registers, stack, memory

One instruction through the pipelineFetch → Decode → Execute → Memory → Writeback

CPU block diagram (Cortex-A class)fetch · decode · execute · register file · L1I · L1D · L2 · MMU · AMBA ACE

RISC vs CISC: RISC keeps a small instruction set with uniform formats; CISC packs more work per instruction. ARM and RISC-V are RISC-style; x86 is CISC at the mnemonic level but executes RISC-like micro-ops internally.

Quick referenceRegisters · CPSR · instructions · condition codes

How double buffering keeps video from tearing

Each path shows who owns each buffer at every moment: the camera is filling one while the display is reading the other, and a flip swaps them at the next refresh. The same idea applies to any double-buffered renderer.

displayed

dropped

tears

Producer / consumer: ISP DMA is the producer; display DMA is the consumer. Vsync is the synchronization edge. Double buffering separates the two so they never share a line; single buffering lets them cross and produces tearing; an ISP that misses vsync produces a frame drop instead of a tear.

A thread sleeps until another thread wakes it

A condition variable is how one thread waits for shared state to change without burning the CPU. pthread_cond_wait drops the mutex and goes to sleep in one atomic step, then re-takes the mutex before it returns. This panel shows why both halves have to be atomic.

Wait queue data structure view

Condition variables avoid busy-spin loops when work is not ready. The waiter releases the mutex and sleeps atomically, then re-locks before return. Always use while(!pred), because wakes can be spurious or consumed by another waiter first.

One store in a driver turns on an LED

A single write to a memory-mapped register lights a pin. The diagram follows that one instruction from the CPU, through the address translation, across the interconnect, into the device's register decoder, and finally to the physical output. It is the shortest possible path from C to hardware.

axi beats

mmu xlates

mmu faults

led state

off

Device-nGnRnE memory forbids merging, reordering, and early acknowledgment, so MMIO writes complete in issue order. DMB OSHST drains the store before the next instruction observes the LED. If the CPU MMU has no mapping for the GPIO region the store takes a translation fault and the LED stays in its previous state.

Virtual memory, twice: how a hypervisor maps a guest

On bare metal an address goes through one page-table walk. Under a hypervisor every guest address is translated twice: once by the guest kernel and again by the host. Each path below walks through both stages and shows where a fault could land.

Stage-1 / Stage-2: Stage-1 uses TTBR0_EL1 (Translation Table Base Register 0, EL1) to walk guest page tables (VA → IPA, or VA → PA on bare metal). Stage-2 uses VTTBR_EL2 (Virtualization Translation Table Base Register, EL2) to map guest physical space into host physical space. Hardware chains both walks; guest software cannot bypass stage-2.
Why DMA uses stage-2 too: When a passthrough device issues DMA, SMMU applies stage-1 (device driver's IOVA map) then stage-2 (hypervisor's guest-physical map). This is how a device can DMA to guest memory without escaping VM boundaries.

Glossary: GVA Guest Virtual Address. IPA Intermediate Physical Address. HPA Host Physical Address. EL1 guest kernel level. EL2 hypervisor level. TTBR0_EL1 guest translation table base register. VTTBR_EL2 stage-2 translation table base register.

Four ways a memory access can fault, and what the kernel does

From a user program, every page fault looks the same: a load that stalled. Inside the kernel they take four different paths. A minor fault hooks up a page that is already in memory. A major fault has to read from disk. A copy-on-write fault makes a private copy of a shared page. A file-backed fault loads through the page cache.

Minor vs major: Minor: PTE (Page Table Entry) was not present, but the page was already in RAM (anonymous zero-fill or file page in page cache). Fix-up is usually microseconds. Major: page was swapped out or not loaded from disk. Fix-up requires I/O, usually milliseconds.
COW (copy-on-write): After fork(), parent and child share anonymous pages as read-only. The first write faults on a writable VMA (Virtual Memory Area) with a read-only PTE. The handler allocates a fresh page, copies data, marks PTE writable, and retries the instruction.

How a device moves bytes without the CPU copying them

The CPU writes a small note in memory describing the work, then pokes the device once to say "go". The device reads memory directly, does the transfer, and fires an interrupt when it is done. This panel walks through that handoff and the cache flush that has to happen so the CPU and device do not see different bytes.

DMA control protocol (what actually happens): 1) Driver writes descriptor entries (address, length, flags, ownership). 2) Driver rings the doorbell register to start DMA. 3) DMA engine fetches descriptors over the bus (AXI read on SoC, MRd/CplD on PCIe). 4) DMA engine transfers payload data between device and RAM. 5) DMA engine raises an interrupt so software can reclaim buffers. In double-buffer and scatter-gather modes, this repeats per descriptor.
Interrupt flags used here: HT = Half Transfer, TC = Transfer Complete, TE = Transfer Error, FE = FIFO Error. These are status bits the interrupt handler checks before it decides whether to queue the next buffer, retry, or reset the engine.

Mapping: Mode:

Coherent mapping: CPU and device stay coherent without explicit dma_sync_single_* calls. Depending on platform this may be uncached or hardware-snoop-coherent, but software treats it as device-visible immediately. This is common for descriptor rings and doorbell shadow data. Why 4 KiB appears in examples: demos often use page-sized chunks because they are easy to reason about, but descriptor length is not fixed to 4 KiB. For streaming buffers, software must hand off ownership with dma_sync_single_for_device() before DMA and dma_sync_single_for_cpu() after DMA.

/* streaming DMA in Linux */ daddr = dma_map_single(dev, kbuf, len, DMA_FROM_DEVICE); writel(daddr, DMA_DST); writel(1, DMA_CR_START); /* doorbell */ /* inside the IRQ handler */ dma_sync_single_for_cpu(dev, daddr, len, DMA_FROM_DEVICE); consume(kbuf); dma_unmap_single(dev, daddr, len, DMA_FROM_DEVICE);

Virtual addresses: how the kernel turns a pointer into real memory

The CPU and any device that touches memory both go through a translation step. The CPU goes through the MMU. Devices go through the IOMMU. Both lookups end up pointing to a physical address in main memory. Under a hypervisor the chain runs twice.

Process VA (MMU walk)

HPA view: stage-2 off, so HPA = PA.

IOMMU / stage-2

TTBR0_EL1 means Translation Table Base Register 0, Exception Level 1. It holds the stage-1 root table base (L0) for the current process context.

VTTBR_EL2 means Virtualization Translation Table Base Register, Exception Level 2. Stage-2 walks start here to map guest physical space into host physical pages.

What L0/L1/L2/L3/off represent
Each L0/L1/L2/L3 field is a 9-bit index into one table level: L0[47:39], L1[38:30], L2[29:21], L3[20:12]. Each level has 512 entries. off[11:0] selects the byte inside the final 4 KiB page.

What table is being walked
TTBR0_EL1 points to the stage-1 root table for CPU accesses in the current address space. For device DMA, the SMMU context descriptor selects the device stage-1 table, then VTTBR_EL2 anchors stage-2.

ASID means Address Space Identifier. TLB lookup keys use {ASID, VPN}, so entries from different processes can coexist safely.

SMMU / IOMMU chain
For DMA, hardware usually resolves IOVA -> IPA -> HPA. Stage-1 maps IOVA to IPA, then stage-2 maps IPA to HPA (Host Physical Address).

Rogue DMA faultA descriptor requested an address outside the device's mapped window. The IOMMU blocks it before memory corruption.

Stage-2 cost: without virtualization, one TLB miss may require up to four table reads (one per level). With stage-2 enabled, those reads may need another walk, so a cold miss can require more memory traffic. Hardware handles this with nested walkers (for example ARM stage-2, AMD NPT, Intel EPT).

Why two threads see writes in different orders

The flag handshake shows how a release/acquire pair fixes a torn read. The store-buffer test goes further: both threads write one variable and read the other, and without a fence both reads can come back as zero on a real CPU, even though that seems impossible from the source.

Test:

Ordering:

Flag handshakeProducer writes data then raises flag. Consumer spins on flag then reads data.

Store buffer testEach core stores to one variable then loads the other. Weak ordering may let both reads see old zeros.

runs

failures

fail rate

n/a

Each test runs under a chosen ordering mode. Results show where weakly ordered CPUs (ARM, RISC-V) can reorder operations unless stronger ordering or explicit fences are used.

Four ways threads coordinate, side by side

The four scenarios sit side by side: a spinlock that burns CPU, a mutex that parks a waiting thread, a priority-inversion case where a high-priority thread waits on a low one, and a lock-free compare-and-swap loop. The point is to see when each one is the right tool. Condition variables get their own panel.

ARM atomic lock handoff: LDXR reads the lock word and marks this core's exclusive attempt. STXR stores only if no competing write happened (0=success, 1=retry). DMB ISH (Data Memory Barrier, Inner Shareable) publishes prior shared-memory writes before lock handoff.

How a single byte travels from a wire into a kernel buffer

Each bus accepts one byte of payload and shows the receiver turning the waveform back into data the driver can read. The view follows the byte through edge sampling, the shift register, the framing check, the FIFO, the interrupt, and the final write into memory.

Protocol: Byte:

Select a protocol to decode one frame.

Protocol engine data path (receiver-focused)

Clock and framing notes

How a camera frame becomes pixels in a user program

Light becomes electrons in the sensor. The sensor packages raw pixels and sends them over the camera bus. The image-signal processor turns those into an RGB frame, the device writes it into memory by itself, and the driver hands the buffer back to the calling program. The same shape applies to most capture devices.

Runtime pipeline state

CSI-2 packet path (simplified)

Frame counter purpose: the CSI-2 receiver increments a frame counter on each SOF (Start Of Frame) and checks it at EOF (End Of Frame). The driver uses this to detect dropped/reordered frames, match DMA completions to the right buffer index, and keep VIDIOC_DQBUF in capture order.

V4L2 buffer flow: one buffer is being filled by DMA, one is active in userspace, and the rest are queued. Sensor exposure creates Bayer lines, CSI-2 transports them, ISP forms output frames, and userspace dequeues completed buffers with VIDIOC_DQBUF. If userspace holds buffers too long, queue depth shrinks and frames may drop with V4L2_BUF_FLAG_ERROR.

How the kernel discovers a piece of hardware and binds a driver to it

On many systems the hardware does not announce itself. A small description file lists every device by name and address, and the kernel walks that list, matches each entry to a driver, and runs the driver's probe function. The three views show one step of that process at a time.

DTS source → dtc compile → FDT blob in RAM → unflatten_device_tree → of_match_device → driver probe()

nodes parsed

drivers bound

deferred

probe faults

Selected node

click a node

Probe log

The bootloader loads a flattened device tree (FDT) blob into RAM. The kernel unflattens it into a node tree, walks compatible strings, matches each node to a driver via of_device_id, and calls the driver's probe(). Probes that need clocks or regulators that are not yet ready return -EPROBE_DEFER and are retried.

From power-on to a login prompt

Each lane is a piece of software that runs during boot. Each arrow is one piece handing control to the next. The bands group the steps by phase, making it easy to spot where firmware stops and the kernel takes over.

1. Firmware

UEFI POST, DXE, NVRAM, ESP load (steps 1–3)

2. Bootloader

GRUB EFI app picks an entry (steps 4–5)

3. Kernel bring-up

vmlinuz load, ExitBootServices, head_64.S, start_kernel (steps 6–9)

4. Initramfs

rest_init, populate_rootfs, busybox /init (steps 10–12)

5. Real root & userspace

mount root, switch_root, systemd, multi-user (steps 13–16)